Data Overview

Here you introduce the data sources, the intent of the EDA, etc. The data source is https://www.kaggle.com/nickhould/craft-cans/data. data is in two tables, beers and breweries that contains information about different craft beers and craft breweries, respectively. Check codebook.xlsx for more info about the columns.

Load Data

library(tidyverse)
library(gridExtra)

beers<-read.csv("craft-cans/beers.csv")
breweries<-read.csv("craft-cans/breweries.csv")

Data Cleaning

Here you explore data structure and clean data if necessary. Note data formats, types and meaning

summary(beers)
##        X               abv               ibu               id        
##  Min.   :   0.0   Min.   :0.00100   Min.   :  4.00   Min.   :   1.0  
##  1st Qu.: 602.2   1st Qu.:0.05000   1st Qu.: 21.00   1st Qu.: 808.2  
##  Median :1204.5   Median :0.05600   Median : 35.00   Median :1453.5  
##  Mean   :1204.5   Mean   :0.05977   Mean   : 42.71   Mean   :1431.1  
##  3rd Qu.:1806.8   3rd Qu.:0.06700   3rd Qu.: 64.00   3rd Qu.:2075.8  
##  Max.   :2409.0   Max.   :0.12800   Max.   :138.00   Max.   :2692.0  
##                   NA's   :62        NA's   :1005                     
##                      name                                 style     
##  Nonstop Hef Hop       :  12   American IPA                  : 424  
##  Dale's Pale Ale       :   6   American Pale Ale (APA)       : 245  
##  Oktoberfest           :   6   American Amber / Red Ale      : 133  
##  Longboard Island Lager:   4   American Blonde Ale           : 108  
##  1327 Pod's ESB        :   3   American Double / Imperial IPA: 105  
##  Boston Lager          :   3   American Pale Wheat Ale       :  97  
##  (Other)               :2376   (Other)                       :1298  
##    brewery_id        ounces     
##  Min.   :  0.0   Min.   : 8.40  
##  1st Qu.: 93.0   1st Qu.:12.00  
##  Median :205.0   Median :12.00  
##  Mean   :231.7   Mean   :13.59  
##  3rd Qu.:366.0   3rd Qu.:16.00  
##  Max.   :557.0   Max.   :32.00  
## 

As expected, Most American Craft Beers are IPAs, and often are an “American Style” Beer. There are interesting duplicates in the name category, so I’ll need to explore that to make sure there is some reason behind the duplication.

summary(breweries)
##        X                              name           city    
##  Min.   :  0.0   Blackrocks Brewery     :  2   Portland: 17  
##  1st Qu.:139.2   Blue Mountain Brewery  :  2   Boulder :  9  
##  Median :278.5   Lucette Brewing Company:  2   Chicago :  9  
##  Mean   :278.5   Oskar Blues Brewery    :  2   Seattle :  9  
##  3rd Qu.:417.8   Otter Creek Brewing    :  2   Austin  :  8  
##  Max.   :557.0   Sly Fox Brewing Company:  2   Denver  :  8  
##                  (Other)                :546   (Other) :498  
##      state    
##   CO    : 47  
##   CA    : 39  
##   MI    : 32  
##   OR    : 29  
##   TX    : 28  
##   PA    : 25  
##  (Other):358

Portland looks to be the top city of craft breweries, but Colorado is the top state. I’m suspicious of Portland’s top spot only because I know there are at least two major cultural cities with the name (ME and OR) - although Portland, OR certainly has a craft brew reputation. There are suspicious duplications in brewery names, so I’ll have to explore that a bit more before visualization as well.

Another note is that while beers has an unnecessary X column that can be dropped, breweries NEEDs the X because it is the primary key. However it should be renamed.

The brewery table needs to strip the whitespace from the state column as well.

beers<-dplyr::select(beers,-X)
breweries<-dplyr::rename(breweries,brewery_id=X)
summary(breweries)
##    brewery_id                         name           city    
##  Min.   :  0.0   Blackrocks Brewery     :  2   Portland: 17  
##  1st Qu.:139.2   Blue Mountain Brewery  :  2   Boulder :  9  
##  Median :278.5   Lucette Brewing Company:  2   Chicago :  9  
##  Mean   :278.5   Oskar Blues Brewery    :  2   Seattle :  9  
##  3rd Qu.:417.8   Otter Creek Brewing    :  2   Austin  :  8  
##  Max.   :557.0   Sly Fox Brewing Company:  2   Denver  :  8  
##                  (Other)                :546   (Other) :498  
##      state    
##   CO    : 47  
##   CA    : 39  
##   MI    : 32  
##   OR    : 29  
##   TX    : 28  
##   PA    : 25  
##  (Other):358
beers$brewery_id<-as.factor(beers$brewery_id)
breweries$brewery_id<-as.factor(breweries$brewery_id)
levels(breweries$state) <- trimws(levels(breweries$state))
summary(breweries)
##    brewery_id                       name           city         state    
##  0      :  1   Blackrocks Brewery     :  2   Portland: 17   CO     : 47  
##  1      :  1   Blue Mountain Brewery  :  2   Boulder :  9   CA     : 39  
##  2      :  1   Lucette Brewing Company:  2   Chicago :  9   MI     : 32  
##  3      :  1   Oskar Blues Brewery    :  2   Seattle :  9   OR     : 29  
##  4      :  1   Otter Creek Brewing    :  2   Austin  :  8   TX     : 28  
##  5      :  1   Sly Fox Brewing Company:  2   Denver  :  8   PA     : 25  
##  (Other):552   (Other)                :546   (Other) :498   (Other):358
beers%>%
  filter(name=='Nonstop Hef Hop')
## Warning: package 'bindrcpp' was built under R version 3.4.4
##      abv ibu   id            name                   style brewery_id
## 1  0.039  20 2400 Nonstop Hef Hop American Pale Wheat Ale         80
## 2  0.039  20 2399 Nonstop Hef Hop American Pale Wheat Ale         80
## 3  0.039  20 2398 Nonstop Hef Hop American Pale Wheat Ale         80
## 4  0.039  20 2397 Nonstop Hef Hop American Pale Wheat Ale         80
## 5  0.039  20 2396 Nonstop Hef Hop American Pale Wheat Ale         80
## 6  0.039  20 2395 Nonstop Hef Hop American Pale Wheat Ale         80
## 7  0.039  20 2394 Nonstop Hef Hop American Pale Wheat Ale         80
## 8  0.039  20 2393 Nonstop Hef Hop American Pale Wheat Ale         80
## 9  0.039  20 2392 Nonstop Hef Hop American Pale Wheat Ale         80
## 10 0.039  20 2391 Nonstop Hef Hop American Pale Wheat Ale         80
## 11 0.039  20 2390 Nonstop Hef Hop American Pale Wheat Ale         80
## 12 0.039  20 2389 Nonstop Hef Hop American Pale Wheat Ale         80
##    ounces
## 1      16
## 2      16
## 3      16
## 4      16
## 5      16
## 6      16
## 7      16
## 8      16
## 9      16
## 10     16
## 11     16
## 12     16

There are straight up duplicates. There is no distinguishing difference between the two. Duplicates can be removed.

beers%>%
  filter(name=="Dale's Pale Ale")
##     abv ibu   id            name                   style brewery_id ounces
## 1 0.065  65 1444 Dale's Pale Ale American Pale Ale (APA)        166   12.0
## 2 0.065  65 1252 Dale's Pale Ale American Pale Ale (APA)        166   12.0
## 3 0.065  65  955 Dale's Pale Ale American Pale Ale (APA)        166   19.2
## 4 0.065  65    1 Dale's Pale Ale American Pale Ale (APA)        166   12.0
## 5 0.065  65 1166 Dale's Pale Ale American Pale Ale (APA)        389   19.2
## 6 0.065  65 1065 Dale's Pale Ale American Pale Ale (APA)        389   12.0

This is interesting. There are multiple sizes AND potentially multiple locations for breweries. There are still duplicates, though.

beers%>%
  group_by(name)%>%
  filter(n()>1)%>%
  arrange(name)
## # A tibble: 187 x 7
## # Groups:   name [82]
##       abv   ibu    id name                      style    brewery_id ounces
##     <dbl> <dbl> <int> <fct>                     <fct>    <fct>       <dbl>
##  1 0.051     20  1813 #9                        Fruit /… 303            16
##  2 0.051     20   360 #9                        Fruit /… 303            12
##  3 0.0560    37   888 1327 Pod's ESB            Extra S… 380            12
##  4 0.0560    37   886 1327 Pod's ESB            Extra S… 380            12
##  5 0.0560    37   612 1327 Pod's ESB            Extra S… 380            12
##  6 0.054     30  2457 312 Urban Pale Ale        America… 88             16
##  7 0.054     30  2202 312 Urban Pale Ale        America… 88             12
##  8 0.042     18  2201 312 Urban Wheat Ale       America… 88             16
##  9 0.042     18  1829 312 Urban Wheat Ale       America… 88             12
## 10 0.05      NA  1293 Angry Orchard Crisp Apple Cider    434            16
## # ... with 177 more rows

Complete duplicates should be collapsed, but duplicates on only name need to be handled differently.

  summary(distinct(beers,abv,ibu,name,style,brewery_id,ounces,.keep_all=TRUE))
##       abv              ibu               id      
##  Min.   :0.0010   Min.   :  4.00   Min.   :   4  
##  1st Qu.:0.0500   1st Qu.: 21.75   1st Qu.: 814  
##  Median :0.0570   Median : 35.00   Median :1455  
##  Mean   :0.0599   Mean   : 42.79   Mean   :1432  
##  3rd Qu.:0.0680   3rd Qu.: 64.00   3rd Qu.:2072  
##  Max.   :0.1280   Max.   :138.00   Max.   :2692  
##  NA's   :62       NA's   :1001                   
##                      name                                 style     
##  Oktoberfest           :   6   American IPA                  : 419  
##  Dale's Pale Ale       :   4   American Pale Ale (APA)       : 242  
##  Longboard Island Lager:   3   American Amber / Red Ale      : 133  
##  #9                    :   2   American Blonde Ale           : 108  
##  312 Urban Pale Ale    :   2   American Double / Imperial IPA: 103  
##  312 Urban Wheat Ale   :   2   American Pale Wheat Ale       :  85  
##  (Other)               :2358   (Other)                       :1287  
##    brewery_id       ounces     
##  10     :  62   Min.   : 8.40  
##  25     :  38   1st Qu.:12.00  
##  166    :  31   Median :12.00  
##  141    :  25   Mean   :13.59  
##  46     :  24   3rd Qu.:16.00  
##  131    :  21   Max.   :32.00  
##  (Other):2176

This does well! Oktoberfest is a generic name, and Dale’s has 4 varietals (potentially).

beers<-distinct(beers,abv,ibu,name,style,brewery_id,ounces,.keep_all=TRUE)

We need to do the same with breweries.

breweries%>%
  group_by(name)%>%
  filter(n()>1)%>%
  arrange(name)
## # A tibble: 14 x 4
## # Groups:   name [7]
##    brewery_id name                    city         state
##    <fct>      <fct>                   <fct>        <fct>
##  1 12         Blackrocks Brewery      Marquette    MI   
##  2 95         Blackrocks Brewery      Marquette    MA   
##  3 382        Blue Mountain Brewery   Afton        VA   
##  4 414        Blue Mountain Brewery   Arrington    VA   
##  5 377        Lucette Brewing Company Menominee    WI   
##  6 456        Lucette Brewing Company Menominie    WI   
##  7 166        Oskar Blues Brewery     Longmont     CO   
##  8 503        Oskar Blues Brewery     Lyons        CO   
##  9 261        Otter Creek Brewing     Waterbury    VT   
## 10 275        Otter Creek Brewing     Middlebury   VT   
## 11 163        Sly Fox Brewing Company Phoenixville PA   
## 12 371        Sly Fox Brewing Company Pottstown    PA   
## 13 58         Summit Brewing Company  St. Paul     MN   
## 14 138        Summit Brewing Company  St Paul      MN

Blackrocks is super suspicious, as the city is THE SAME, but the state is different. Checking online, Blackrocks is ONLY in MI.

Blue Mountain is a legitimate split, BM Brewery splits their normal and specialty aged beers between the two locations

Lucette is clearly a misspelling. Hilariously, BOTH are misspellings. Lucette is in Menomonie, WI

Oskar Blues, like BM, actually has two locations (in reality they have several locations and breweries). However I doubt there is a type division

Otter Creek used to be in Waterbury before expanding and moving to Middlebury

Sly fox has two actual locations

Summit has a spelling discrepancy. We want to keep “St. Paul”.

In summary, Only three changes need to be made: to Blackrocks, Lucette, and Summit. Importantly, we need to preserve the joinability of the tables by altering the brewery_id on affected beers.

beers%>%
  filter(brewery_id %in% c(12,95))
##     abv ibu   id              name                   style brewery_id
## 1 0.073  NA 2449     Flying Sailor                Rye Beer         95
## 2 0.048  47 2634         Nordskye             American IPA         12
## 3 0.060  30 2153 North Third Stout  Foreign / Export Stout         12
## 4 0.052  NA 1953         Honey Lav American Pale Wheat Ale         12
## 5 0.068  NA 1496 Coconut Brown Ale      American Brown Ale         12
## 6 0.070  51 1481           51K IPA            American IPA         12
## 7 0.055  NA 1480     Grand Rabbits               Cream Ale         12
##   ounces
## 1     12
## 2     12
## 3     12
## 4     12
## 5     12
## 6     12
## 7     12
beers%>%
  mutate(brewery_id=replace(brewery_id,brewery_id==95,12))->beers
## Test
beers%>%
  filter(brewery_id %in% c(12,95))
##     abv ibu   id              name                   style brewery_id
## 1 0.073  NA 2449     Flying Sailor                Rye Beer         12
## 2 0.048  47 2634         Nordskye             American IPA         12
## 3 0.060  30 2153 North Third Stout  Foreign / Export Stout         12
## 4 0.052  NA 1953         Honey Lav American Pale Wheat Ale         12
## 5 0.068  NA 1496 Coconut Brown Ale      American Brown Ale         12
## 6 0.070  51 1481           51K IPA            American IPA         12
## 7 0.055  NA 1480     Grand Rabbits               Cream Ale         12
##   ounces
## 1     12
## 2     12
## 3     12
## 4     12
## 5     12
## 6     12
## 7     12
beers%>%
  filter(brewery_id %in% c(377,456))
##     abv ibu   id                  name                   style brewery_id
## 1 0.052  29 1571       Slow Hand Stout          American Stout        377
## 2 0.062  NA 1204        Hips Don't Lie              Hefeweizen        456
## 3 0.052  NA 1122   Ride Again Pale Ale American Pale Ale (APA)        456
## 4 0.048  NA  700 The Farmer's Daughter     American Blonde Ale        456
##   ounces
## 1     16
## 2     16
## 3     16
## 4     16
beers%>%
  mutate(brewery_id=replace(brewery_id,brewery_id==377,456))->beers
## Test
beers%>%
  filter(brewery_id %in% c(377,456))
##     abv ibu   id                  name                   style brewery_id
## 1 0.052  29 1571       Slow Hand Stout          American Stout        456
## 2 0.062  NA 1204        Hips Don't Lie              Hefeweizen        456
## 3 0.052  NA 1122   Ride Again Pale Ale American Pale Ale (APA)        456
## 4 0.048  NA  700 The Farmer's Daughter     American Blonde Ale        456
##   ounces
## 1     16
## 2     16
## 3     16
## 4     16
beers%>%
  filter(brewery_id %in% c(58,138))
##     abv ibu   id                   name
## 1 0.053  49 2352         Extra Pale Ale
## 2 0.053  40 2549             Make It So
## 3 0.047  55 2473    Hopvale Organic Ale
## 4 0.083 100 2415 Unchained #18 Hop Silo
##                                 style brewery_id ounces
## 1             American Pale Ale (APA)        138     12
## 2 Extra Special / Strong Bitter (ESB)         58     12
## 3             American Pale Ale (APA)         58     16
## 4      American Double / Imperial IPA         58     16
beers%>%
  mutate(brewery_id=replace(brewery_id,brewery_id==138,58))->beers
## Test
beers%>%
  filter(brewery_id %in% c(58,138))
##     abv ibu   id                   name
## 1 0.053  49 2352         Extra Pale Ale
## 2 0.053  40 2549             Make It So
## 3 0.047  55 2473    Hopvale Organic Ale
## 4 0.083 100 2415 Unchained #18 Hop Silo
##                                 style brewery_id ounces
## 1             American Pale Ale (APA)         58     12
## 2 Extra Special / Strong Bitter (ESB)         58     12
## 3             American Pale Ale (APA)         58     16
## 4      American Double / Imperial IPA         58     16

So we will want to remove the breweries with the following IDs: 95, 377, and 138. Additionally, We need to modify Lucette to have to correct spelling. Each has only one outlier

breweries%>%
  filter(!brewery_id %in% c(95,377,138))->breweries

## Test
breweries%>%
  group_by(name)%>%
  filter(n()>1)%>%
  arrange(name)
## # A tibble: 8 x 4
## # Groups:   name [4]
##   brewery_id name                    city         state
##   <fct>      <fct>                   <fct>        <fct>
## 1 382        Blue Mountain Brewery   Afton        VA   
## 2 414        Blue Mountain Brewery   Arrington    VA   
## 3 166        Oskar Blues Brewery     Longmont     CO   
## 4 503        Oskar Blues Brewery     Lyons        CO   
## 5 261        Otter Creek Brewing     Waterbury    VT   
## 6 275        Otter Creek Brewing     Middlebury   VT   
## 7 163        Sly Fox Brewing Company Phoenixville PA   
## 8 371        Sly Fox Brewing Company Pottstown    PA

Now just to fix Menomonie

breweries%>%
  filter(state=='WI')
##    brewery_id                         name          city state
## 1          32   James Page Brewing Company Stevens Point    WI
## 2          73                   Ale Asylum       Madison    WI
## 3         120          Horny Goat Brew Pub     Milwaukee    WI
## 4         131        Stevens Point Brewery Stevens Point    WI
## 5         134         Minhas Craft Brewery        Monroe    WI
## 6         146                 Angry Minnow       Hayward    WI
## 7         192              Capital Brewery     Middleton    WI
## 8         211    Wisconsin Brewing Company        Verona    WI
## 9         228  Geneva Lake Brewing Company   Lake Geneva    WI
## 10        284    Milwaukee Brewing Company     Milwaukee    WI
## 11        349           Hinterland Brewery     Green Bay    WI
## 12        406            Lazy Monk Brewing    Eau Claire    WI
## 13        425                    Ciderboys Stevens Point    WI
## 14        456      Lucette Brewing Company     Menominie    WI
## 15        486       Stillmank Beer Company     Green Bay    WI
## 16        498     Sprecher Brewing Company      Glendale    WI
## 17        529           Northwoods Brewpub    Eau Claire    WI
## 18        542 Angry Minnow Brewing Company       Hayward    WI
## 19        554              Dave's Brewfarm        Wilson    WI
breweries%>%
  filter(city=="Menominie")
##   brewery_id                    name      city state
## 1        456 Lucette Brewing Company Menominie    WI
breweries%>%
  mutate(city = dplyr::recode(city, "Menominie" = "Menomonie"))->breweries
## Test
breweries%>%
  filter(brewery_id==456)
##   brewery_id                    name      city state
## 1        456 Lucette Brewing Company Menomonie    WI

Lets take one more look at the data summaries before we break into visualization

summary(beers)
##       abv              ibu               id      
##  Min.   :0.0010   Min.   :  4.00   Min.   :   4  
##  1st Qu.:0.0500   1st Qu.: 21.75   1st Qu.: 814  
##  Median :0.0570   Median : 35.00   Median :1455  
##  Mean   :0.0599   Mean   : 42.79   Mean   :1432  
##  3rd Qu.:0.0680   3rd Qu.: 64.00   3rd Qu.:2072  
##  Max.   :0.1280   Max.   :138.00   Max.   :2692  
##  NA's   :62       NA's   :1001                   
##                      name                                 style     
##  Oktoberfest           :   6   American IPA                  : 419  
##  Dale's Pale Ale       :   4   American Pale Ale (APA)       : 242  
##  Longboard Island Lager:   3   American Amber / Red Ale      : 133  
##  #9                    :   2   American Blonde Ale           : 108  
##  312 Urban Pale Ale    :   2   American Double / Imperial IPA: 103  
##  312 Urban Wheat Ale   :   2   American Pale Wheat Ale       :  85  
##  (Other)               :2358   (Other)                       :1287  
##    brewery_id       ounces     
##  10     :  62   Min.   : 8.40  
##  25     :  38   1st Qu.:12.00  
##  166    :  31   Median :12.00  
##  141    :  25   Mean   :13.59  
##  46     :  24   3rd Qu.:16.00  
##  131    :  21   Max.   :32.00  
##  (Other):2176
str(beers)
## 'data.frame':    2377 obs. of  7 variables:
##  $ abv       : num  0.05 0.066 0.071 0.09 0.075 0.077 0.045 0.065 0.055 0.086 ...
##  $ ibu       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ id        : int  1436 2265 2264 2263 2262 2261 2260 2259 2258 2131 ...
##  $ name      : Factor w/ 2305 levels "#001 Golden Amber Lager",..: 1638 577 1704 1842 1819 268 1160 758 1093 486 ...
##  $ style     : Factor w/ 100 levels "","Abbey Single Ale",..: 19 18 16 12 16 80 18 22 18 12 ...
##  $ brewery_id: Factor w/ 558 levels "0","1","2","3",..: 409 178 178 178 178 178 178 178 178 178 ...
##  $ ounces    : num  12 12 12 12 12 12 12 12 12 12 ...
summary(breweries)
##    brewery_id                         name           city    
##  0      :  1   Blue Mountain Brewery    :  2   Portland: 17  
##  1      :  1   Oskar Blues Brewery      :  2   Boulder :  9  
##  2      :  1   Otter Creek Brewing      :  2   Chicago :  9  
##  3      :  1   Sly Fox Brewing Company  :  2   Seattle :  9  
##  4      :  1   10 Barrel Brewing Company:  1   Austin  :  8  
##  5      :  1   18th Street Brewery      :  1   Denver  :  8  
##  (Other):549   (Other)                  :545   (Other) :495  
##      state    
##  CO     : 47  
##  CA     : 39  
##  MI     : 32  
##  OR     : 29  
##  TX     : 28  
##  PA     : 25  
##  (Other):355
str(breweries)
## 'data.frame':    555 obs. of  4 variables:
##  $ brewery_id: Factor w/ 558 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name      : Factor w/ 551 levels "10 Barrel Brewing Company",..: 355 12 266 319 201 136 227 477 59 491 ...
##  $ city      : Factor w/ 384 levels "Abingdon","Abita Springs",..: 228 200 122 299 300 62 91 48 152 136 ...
##  $ state     : Factor w/ 51 levels "AK","AL","AR",..: 24 18 20 5 5 41 6 23 23 23 ...

Visualizations

Univariate Analysis

abvcolor<-aes(color=I('gray'),fill=I('tomato'))

ggplot(data=beers,aes(x=abv))+
  geom_histogram(binwidth=.002,abvcolor)
## Warning: Removed 62 rows containing non-finite values (stat_bin).

Many beers have ABV of .05. 5% is pretty standard beer alcohol content. It skews right - and I think this is because craft beers generally try to be unique and have something that makes it different than a domestic. Higher alcohol content is a simple way to justify the craft value.

ibucolor<-aes(color=I('gray'),fill=I('limegreen'))

ggplot(data=beers,aes(x=ibu))+
  geom_histogram(ibucolor,binwidth = 5)
## Warning: Removed 1001 rows containing non-finite values (stat_bin).

IBUs measure the bitterness. There was (maybe still is) a IPA craze in craft beer. It is a weird machismo thing to max out bitterness at the expense of balance. It is a common way to distinguish yourself - check the peak at 100. The otherwise bimodality is likely due to most beer types have IBUs under 25, but IPAs and other high-hopped beers seek for near 75.

ozcolor<-aes(color=I('gray'),fill=I('cadetblue'))

ggplot(data=beers,aes(x=ounces))+
  geom_bar(ozcolor)+
  scale_x_continuous(breaks=seq(0,32,4))

There are very common beer can sizes. 12 oz, 16 oz etc. There are some oddballs.

ggplot(data=beers,aes(x=reorder(style,style,function(x)+length(x))))+
  geom_bar()+
  scale_y_log10()+
  coord_flip()

Note the log scale.American style beers dominate the american craft brew scene. It is noteworthy that lagers are technically more difficult and expensive to make, and a lot of craft breweries brew ales. It is also noteworthy that there are around 10 beers without a style attached to them.

joineddata<-dplyr::left_join(beers,breweries, by="brewery_id",suffix=c(".beer",".brewer"))
summary(joineddata)
##       abv              ibu               id      
##  Min.   :0.0010   Min.   :  4.00   Min.   :   4  
##  1st Qu.:0.0500   1st Qu.: 21.75   1st Qu.: 814  
##  Median :0.0570   Median : 35.00   Median :1455  
##  Mean   :0.0599   Mean   : 42.79   Mean   :1432  
##  3rd Qu.:0.0680   3rd Qu.: 64.00   3rd Qu.:2072  
##  Max.   :0.1280   Max.   :138.00   Max.   :2692  
##  NA's   :62       NA's   :1001                   
##                   name.beer                               style     
##  Oktoberfest           :   6   American IPA                  : 419  
##  Dale's Pale Ale       :   4   American Pale Ale (APA)       : 242  
##  Longboard Island Lager:   3   American Amber / Red Ale      : 133  
##  #9                    :   2   American Blonde Ale           : 108  
##  312 Urban Pale Ale    :   2   American Double / Imperial IPA: 103  
##  312 Urban Wheat Ale   :   2   American Pale Wheat Ale       :  85  
##  (Other)               :2358   (Other)                       :1287  
##    brewery_id       ounces                          name.brewer  
##  10     :  62   Min.   : 8.40   Brewery Vivant            :  62  
##  25     :  38   1st Qu.:12.00   Oskar Blues Brewery       :  44  
##  166    :  31   Median :12.00   Sun King Brewing Company  :  38  
##  141    :  25   Mean   :13.59   Cigar City Brewing Company:  25  
##  46     :  24   3rd Qu.:16.00   Sixpoint Craft Ales       :  24  
##  131    :  21   Max.   :32.00   Stevens Point Brewery     :  21  
##  (Other):2176                   (Other)                   :2163  
##            city          state     
##  Grand Rapids:  66   CO     : 261  
##  Chicago     :  55   CA     : 181  
##  Portland    :  53   MI     : 163  
##  Indianapolis:  43   IN     : 139  
##  San Diego   :  42   TX     : 130  
##  Boulder     :  41   OR     : 114  
##  (Other)     :2077   (Other):1389
top20brew<-joineddata %>%
  count(brewery_id) %>%
  top_n(20)

ggplot(data=subset(joineddata,brewery_id %in% top20brew$brewery_id),
       aes(x=reorder(name.brewer,name.brewer, function(x)+length(x))))+
  geom_bar()+
  coord_flip()

Some breweries are going crazy pumping out different beers.

I don’t think looking at name will yield much, as the names are largely unique.

Now lets look at the breweries table closer.

statecolor=aes(color=I('gray'),fill=I('mediumpurple'))
ggplot(data=breweries, aes(x=reorder(state,state, function(x)+length(x))))+
  geom_bar(statecolor)+
  coord_flip()

Colorado is king! Colorado has a huge brewing history, as well as Michigan. California is just huge. Oregon has a reputation for craft breweries, especially in Portland.

Speaking of Portland, there are two important Portlands, so I want to check if duplicate city names in different states is a factor. likely just need to create a new variable that combines the two.

breweries %>%
  mutate(citystate=interaction(city,state,sep=", "))->breweries
summary(breweries)
##    brewery_id                         name           city    
##  0      :  1   Blue Mountain Brewery    :  2   Portland: 17  
##  1      :  1   Oskar Blues Brewery      :  2   Boulder :  9  
##  2      :  1   Otter Creek Brewing      :  2   Chicago :  9  
##  3      :  1   Sly Fox Brewing Company  :  2   Seattle :  9  
##  4      :  1   10 Barrel Brewing Company:  1   Austin  :  8  
##  5      :  1   18th Street Brewery      :  1   Denver  :  8  
##  (Other):549   (Other)                  :545   (Other) :495  
##      state             citystate  
##  CO     : 47   Portland, OR : 11  
##  CA     : 39   Boulder, CO  :  9  
##  MI     : 32   Chicago, IL  :  9  
##  OR     : 29   Seattle, WA  :  9  
##  TX     : 28   San Diego, CA:  8  
##  PA     : 25   Denver, CO   :  8  
##  (Other):355   (Other)      :501

We can see that the number of “Portland” cities is not equal to the number of “Portland, OR”, meaning that Portland, ME has a few craft breweries.

top10citystate<-breweries %>%
  count(citystate) %>%
  top_n(10)

ggplot(data=subset(breweries,citystate %in% top10citystate$citystate),
       aes(x=reorder(citystate,citystate, function(x)+length(x))))+
  geom_bar(statecolor)+
  coord_flip()

Both Portlands are in the top 10.

Univariate Summary

What is the structure of your dataset?

standard relational dataset.

What is/are the main feature(s) of interest in your dataset?

Things like ABV and IBU are important, as well as location.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

style will be interesting to correlate with ABV and IBU.

Did you create any new variables from existing variables in the dataset?

I created a citystate variable to account for same city names in different states

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There is a slight tail on abv, which makes sense. Some beers are high alcohol content by design. There is a bimodality (or tri-modality) in IBUs, which is likely tied to unique beer styles that are know for bitterness, like IPAs.

Bivariate Analysis

I should’ve just done this in the beginning, but I’ll work exclusively on the joined data now.

joineddata<-dplyr::left_join(beers,breweries, by="brewery_id",suffix=c(".beer",".brewer"))
summary(joineddata)
##       abv              ibu               id      
##  Min.   :0.0010   Min.   :  4.00   Min.   :   4  
##  1st Qu.:0.0500   1st Qu.: 21.75   1st Qu.: 814  
##  Median :0.0570   Median : 35.00   Median :1455  
##  Mean   :0.0599   Mean   : 42.79   Mean   :1432  
##  3rd Qu.:0.0680   3rd Qu.: 64.00   3rd Qu.:2072  
##  Max.   :0.1280   Max.   :138.00   Max.   :2692  
##  NA's   :62       NA's   :1001                   
##                   name.beer                               style     
##  Oktoberfest           :   6   American IPA                  : 419  
##  Dale's Pale Ale       :   4   American Pale Ale (APA)       : 242  
##  Longboard Island Lager:   3   American Amber / Red Ale      : 133  
##  #9                    :   2   American Blonde Ale           : 108  
##  312 Urban Pale Ale    :   2   American Double / Imperial IPA: 103  
##  312 Urban Wheat Ale   :   2   American Pale Wheat Ale       :  85  
##  (Other)               :2358   (Other)                       :1287  
##    brewery_id       ounces                          name.brewer  
##  10     :  62   Min.   : 8.40   Brewery Vivant            :  62  
##  25     :  38   1st Qu.:12.00   Oskar Blues Brewery       :  44  
##  166    :  31   Median :12.00   Sun King Brewing Company  :  38  
##  141    :  25   Mean   :13.59   Cigar City Brewing Company:  25  
##  46     :  24   3rd Qu.:16.00   Sixpoint Craft Ales       :  24  
##  131    :  21   Max.   :32.00   Stevens Point Brewery     :  21  
##  (Other):2176                   (Other)                   :2163  
##            city          state                 citystate   
##  Grand Rapids:  66   CO     : 261   Grand Rapids, MI:  66  
##  Chicago     :  55   CA     : 181   Chicago, IL     :  55  
##  Portland    :  53   MI     : 163   Indianapolis, IN:  43  
##  Indianapolis:  43   IN     : 139   San Diego, CA   :  42  
##  San Diego   :  42   TX     : 130   Boulder, CO     :  41  
##  Boulder     :  41   OR     : 114   Portland, OR    :  41  
##  (Other)     :2077   (Other):1389   (Other)         :2089

Start with a ggpairs, as it plots all the bivariates at once.

library(GGally)
library(scales)
library(memisc)

ggpairs(dplyr::select(joineddata,
                      -ends_with("id"),-starts_with("name."),-starts_with("city"),-style,-state),
        lower = list(continuous = wrap("points", shape = I('.'))),
        upper = list(combo = wrap("box", outlier.shape = I('.'))))

Unfortunately, most of the data has a huge number of factors and isn’t ameniable to pairplotting. But this is still useful.

ggplot(data=joineddata,aes(x=abv,y=ibu))+
  geom_point(alpha=.5)+
  geom_smooth(method="lm",formula=y~x)+
  geom_density2d()
## Warning: Removed 1001 rows containing non-finite values (stat_smooth).
## Warning: Removed 1001 rows containing non-finite values (stat_density2d).
## Warning: Removed 1001 rows containing missing values (geom_point).

with(joineddata,cor.test(abv,ibu))
## 
##  Pearson's product-moment correlation
## 
## data:  abv and ibu
## t = 33.284, df = 1374, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6377807 0.6963675
## sample estimates:
##       cor 
## 0.6681083

There is a correlation between ABV and IBU. There is no technical reason for these to be correlated, as alcohol content and ibu are adjusted with different components of beer. However, it is a competitive advantage in niche craft markets to be exceptional. the hoppiest and most alcoholic beer is a branding edge that a small craft brewery would strive to have.

ggplot(data=joineddata,aes(x=reorder(state,-abv,median,na.rm = TRUE), y=abv))+
  geom_jitter(alpha=.3,color='tomato')+
  geom_boxplot(alpha=.5,fill='tomato')+
  stat_summary(fun.y="mean",geom="point",color="green",shape=8,size=2)+
  coord_flip()
## Warning: Removed 62 rows containing non-finite values (stat_boxplot).
## Warning: Removed 62 rows containing non-finite values (stat_summary).
## Warning: Removed 62 rows containing missing values (geom_point).

Utah has a oddball beer law, saying beer must be 4% ABV max, whihc explains their location! Many of the states have pretty wide spreads.

ggplot(data=joineddata,aes(x=reorder(state,-ibu,median,na.rm = TRUE), y=ibu))+
  geom_jitter(alpha=.3,color='limegreen')+
  geom_boxplot(alpha=.5,fill='limegreen')+
  stat_summary(fun.y="mean",geom="point",color="red",shape=8,size=2)+
  coord_flip()
## Warning: Removed 1001 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1001 rows containing non-finite values (stat_summary).
## Warning: Removed 1001 rows containing missing values (geom_point).

There is an interesting narrative of having mostly midwestern states near the bottom of IBUs, and having southern and costal states near the top. it isnt a perfect narrative.

This needs to be loaded up into tableau for some interesting maps.

Another important thing to do is to actually make a region variable! If we categorize the states into the four regions: Northeast, South, Midwest, and West; and into divisions: New England, Mid-atlantic, East North Central, West North Central, South atlantic, East South Central, West South Central, Mountain, and Pacific; we may see cultural or regional differences with better clarity.

state<- unique(joineddata$state)
  
divisions<-data.frame(state)
divisions<-divisions%>%
  mutate(division=case_when(state %in% c("CT","ME","MA","NH","RI","VT") ~ "New England",
                            state %in% c("NJ","NY","PA") ~ "Mid-Atlantic",
                            state %in% c("WV","MD","DE","DC","VA","NC","SC","GA","FL") ~ "South Atlantic",
                            state %in% c("KY","TN","MS","AL") ~ "East South Central",
                            state %in% c("MI","OH","IN","IL","WI") ~ "East North Central",
                            state %in% c("MN","IA","MO","ND","SD","NE","KS") ~ "West North Central",
                            state %in% c("OK","AR","LA","TX") ~ "West South Central",
                            state %in% c("CO","NM","WY","MT","ID","UT","AZ","NV") ~ "Mountain",
                            state %in% c("WA","OR","CA","AK","HI") ~ "Pacific",
                            TRUE ~ "OTHER"))
divisions<-divisions%>%
  mutate(region=case_when(division %in% c("New England","Mid-Atlantic") ~ "Northeast",
                            division %in% c("South Atlantic","East South Central","West South Central") ~"South",
                            division %in% c("East North Central","West North Central") ~"Midwest",
                            division %in% c("Mountain","Pacific") ~ "West",
                            TRUE ~ "OTHER"))

Now we can reupdate the joineddata by joining our new table!

joineddata<-joineddata %>%
  left_join(divisions,by="state")

Lets look at divisions and regions for ABV and IBU

ggplot(data=joineddata,aes(x=reorder(division,-abv,median,na.rm = TRUE), y=abv))+
  geom_jitter(alpha=.3,color='tomato')+
  geom_boxplot(alpha=.5,fill='tomato')+
  stat_summary(fun.y="mean",geom="point",color="green",shape=8,size=2)+
  coord_flip()
## Warning: Removed 62 rows containing non-finite values (stat_boxplot).
## Warning: Removed 62 rows containing non-finite values (stat_summary).
## Warning: Removed 62 rows containing missing values (geom_point).

Every division seems to keep it pretty close to 6%, but it looks like the southwestern states like slightly lower abv than the great lakes. Maybe more ice beers in the colder climate?

ggplot(data=joineddata,aes(x=reorder(division,-ibu,median,na.rm = TRUE), y=ibu))+
  geom_jitter(alpha=.3,color='limegreen')+
  geom_boxplot(alpha=.5,fill='limegreen')+
  stat_summary(fun.y="mean",geom="point",color="red",shape=8,size=2)+
  coord_flip()
## Warning: Removed 1001 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1001 rows containing non-finite values (stat_summary).
## Warning: Removed 1001 rows containing missing values (geom_point).

This is neat, the pacific coast states have a huge range in IBUs, while the heartland states maybe don’t like the bitterness for the sake of bitterness.

ggplot(data=joineddata,aes(x=reorder(region,-abv,median,na.rm = TRUE), y=abv))+
  geom_jitter(alpha=.3,color='tomato')+
  geom_boxplot(alpha=.5,fill='tomato')+
  stat_summary(fun.y="mean",geom="point",color="green",shape=8,size=2)+
  coord_flip()
## Warning: Removed 62 rows containing non-finite values (stat_boxplot).
## Warning: Removed 62 rows containing non-finite values (stat_summary).
## Warning: Removed 62 rows containing missing values (geom_point).

the differences get even smaller. The west coast seems to be more experimental than the other areas. Likely just a factor of number of breweries/beers.

ggplot(data=joineddata,aes(x=reorder(region,-ibu,median,na.rm = TRUE), y=ibu))+
  geom_jitter(alpha=.3,color='limegreen')+
  geom_boxplot(alpha=.5,fill='limegreen')+
  stat_summary(fun.y="mean",geom="point",color="red",shape=8,size=2)+
  coord_flip()
## Warning: Removed 1001 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1001 rows containing non-finite values (stat_summary).
## Warning: Removed 1001 rows containing missing values (geom_point).

Looks like the same trend holds, the midwest likes the more balanced bitterness (influenced by budweiser in STL maybe?) while the west coast likes to experiment more.

Bivariate Summary

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Analysis

ggplot(data=joineddata,aes(x=abv,y=ibu,color=region))+
  geom_point(alpha=.5)+
  geom_smooth(method="lm",formula=y~x)
## Warning: Removed 1001 rows containing non-finite values (stat_smooth).
## Warning: Removed 1001 rows containing missing values (geom_point).

Basically no distinction within these regions, except that the south and the midwest have a few particularly high ABV choices.

ggplot(data=joineddata,aes(x=abv,y=ibu,color=division))+
  geom_point(alpha=.5)
## Warning: Removed 1001 rows containing missing values (geom_point).

Modeling Data

optional part, use a model to try to fit data.

# 
# m1 <- polr(quality ~ variable1, data = edadata, Hess=TRUE)
# m2 <- update(m1, ~ . + variable2)
# m3 <- update(m2, ~ . + variable3)
# 
# mtable(m1, m2, m3)

Multivariate Summary

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Final Plots

# copied from above

Highlight the most impactful and interesting plots from above, giving a standalone discription of the plot and what it shows.

Grand Summary and Reflection

Reflect on data quality - the difficulties and deficits of the data - and potential options to expand data collection to remedy deficits.

Reflect on data structure that appeals to common beliefs about the data, structure that challenges preconceived notions, and most importantly any surprising and un-thought-of structure that stands out.

Note on next steps (collect more data, further exploration, machine learning)